■fa  in  ft  "wifca 


J/UJGOR 


}  1 


ELLIPTIC  EQUATION  SOLVERS 
FOR  NORDA  OCEAN  MODELS  AND  A  STUDY 
OF  EQUATORIAL  OCEAN  DYNAMICS 

J206-82-001/6204 


Alan  J.  Wallcraft 


Accession  For 

BTIS  ORA A I 

x 

DTIC  TAB 

□ 

Unannounced 

□ 

Justification _ 

BvtW  Izhc,  ftv 

a  CilO 

Prepared  for: 

Naval  Ocean  Research  and  Development  Activity 
N$TL  Station,  MS  39529 

Under: 

Contract  Number  N00014-81-C-0085 
June  2,  1982 


Distribution/ 
Availability  Codas 
(Avail  and/or 
)lst  Special' 


206  South  Whiting  Street 


DISTRIBUTION  STATEMENT  A 

Approved  for  public  leleasa) 
Distribution  Unlimited 

Alexandria,  Virginia  22304 


s 


DTIC 

ELECTE 
JUL  231982 


D 

(70S)  823-1300 


UNCLASSIFIED 


SECURITY  CLASSIFICATION  OF  THIS  FACE  fWh«n  Dote  Entotog) 


REPORT  DOCUMENTATION  PAGE 

READ  INSTRUCTIONS 
BEFORE  COMPLETING  FORM 

1.  REFORT  NUMBER  2.  OOVT  ACCESSION  NO. 

J206-82-001/6204  -0W  7  !>  7  ^ 

S.  RECIPIENT'S  CATALOG  NUMBER 

4.  TITLE  (mtd  Subtitle) 

ELLIPTIC  EQUATION  SOLVERS  FOR  NORDA  OCEAN  MODELS 
AND  A  STUDY  OF  EQUATORIAL  OCEAN  DYNAMICS 

S.  TVFE  OF  REFORT  *  FERIOO  COVERED 

Final  Report 

11/25/80  -  11/24/81 

S.  PERFORMING  ORO.  REPORT  NUMBER 

J206-82-001/6204 

7.  author?*; 

Dr.  Alan  J.  Wall craft 

s.  contract  or  grant  number?*; 

N00014-81 -C-0085 

S.  FERFORMINO  ORGANIZATION  NAME  AND  ADDRESS 

JAYCOR 

205  South  Whiting  Street 

Alexandria,  VA  22304 

10.  PROGRAM  ELEMENT.  PROJECT.  TASK 
AREA  •  WORK  UNIT  NUMBERS 

A002 

II.  CONTROLLING  OFFICE  NAME  AND  ADDRESS 

Naval  Ocean  Research  and  Development  Activity 

NSTL  Station,  MS  39529 

IS.  REPORT  OATE 

June  2,  1982 

39  pages 

14.  MONITORING  AGENCY  NAME  S  ADDRESS?!#  dlllotont  Inm  Controlling  Office) 

Naval  Ocean  Research  and  Development  Activity 
NSTL  Station,  MS  39529 

IS.  SECURITY  CLASS,  (ml  tklo  too  CM) 

UNCLASSIFIED 

is*,  decl assi  fi cation/  downgrading 

N  TTDULE 

II.  DisTmniiTinM  utatfmfiit  /«#  *htm  »mm* % 


DISTRIBUTION  STATEMEwf~A" 

Apptovod  fat  public  xolociMl 
Distribution  Unlimited 


17.  DISTRIBUTION  STATEMENT  ?•#  Oim  obmtroct  ontorod  In  Black  20.  II  dlfmrmmt  tram  kmoort) 


IS.  SUFFLEMEnTARV  NOTES 


IS.  key  WORDS  (Continue  on  rorormo  mlgo  II  nocoooc rr  eng  MuiMIJr  Sr  block  nimbmr) 


SO.  ABSTRACT  (Conttnuo  on  romoroo  d4o  II  nocooom r  *"4  Identify  by  block  mmkm) 


SECURITY 


UNCLASSIFIED 

MTV  CLASS!  FlCAT* 


•tow  OF  THIS  BABE  fH>—  Pats  SotaraO 


00  ,52T„  1473  EOITION  OF  I  NOV  OS  IS  OBSOLETE 
S/N  01M-LF-014-6601 


TABLE  OF  CONTENTS 

Page 


I.  ELLIPTIC  SOLVER  DEVELOPMENT  .  1 

Tridiagonal  Solvers  .  1 

Rectangular  Helmholtz  Solvers  .  5 

Non- rectangular  Elliptic  Solvers . 13 

Solvers  on  the  Cyber  203 .  23 

II.  OCEAN  MODEL  DEVELOPMENT  .  28 

III.  EQUATORIAL  DYNAMICS . 30 

IV.  REFERENCES . 38 


I.  &LIPTIC  SOLVER  DEVELOPMENT 

TRIDIAGONAL  SOLVERS 

Several  of  NORDA's  elliptic  solvers  calculate  the  solution  of  a 
number  of  trldlagonal  systems  as  part  of  their  total  solution  process. 

If  just  one  trldlagonal  system  Is  to  be  solved  on  a  vector 
machine  then  the  standard  algorithms,  variants  of  Gaussian  elimination, 
are  not  suitable  due  to  their  recursive  nature.  Probably  the  best 
method  under  these  circumstances  Is  scalar  cyclic  reduction 
(Swartztrauber,  1979),  which  has  a  higher  operation  count  than  Gaussian 
elimination  but  Is  vectorlzable  with  average  vector  length  N/Log2N  on 
systems  of  dimension  N.  Even  this  method  Is  only  totally  satisfactory 
for  systems  of  dimension  0(1,000),  and  hence  potentially  powerful 
elliptic  solvers  requiring  the  sequential  solution  of  many  trldlagonal 
systems  of  dimension  0(100)  (e.g.,  block  cyclic  reduction)  are  not 
viable  on  vector  processors.  However  the  trldlagonal  systems  In  NORDA's 
elliptic  solvers  can  be  solved  In  parallel.  Thus  Gaussian  elimination 
Is  vectorlzable  with  a  potential  vector  length  at  least  as  large  as  the 
number  of  parallel  systems,  and  on  vector  machines  with  only  Inner  loop 
vector 1z at ion  (e.g.,  CRAY-1,  Cyber  203/205),  It  Is  clearly  the  best 
method  under  these  circumstances  (Swarztrauber,  1980).  However  the 
situation  Is  less  clear  on  the  TIASC;  suppose  the  solution  of  M  systems, 
each  of  dimension  N,  Is  desired  on  a  two  pipe  TIASC,  then: 

e  Standard  Gaussian  elimination  has  a  vector  length  of  M/2. 

e  Gaussian  elimination  with  folding  has  vector  length  of  M. 

e  Scalar  cyclic  reduction  has  a  vector  length  of  M.  Log2N. 

The  reason  that  a  'non  standard'  version  of  Gaussian 
elimination  should  be  used  Is  that  the  TIASC  has  two  independent  vector 
pipes  but  standard  Gaussian  elimination  must  perform  Its  vector 
operations  In  strict  sequence.  The  effect  Is  that  each  vector  operation 
Is  shared  between  the  two  pipes,  giving  rise  to  two  vector  start-ups  per 
operation  (or  an  average  vector  length  of  M/2).  The  version  with 


folding  eliminates  off-diagonal  elements  from  the  top  and  bottom  of  each 
system  working  Into  the  center,  then  backsubstltutes  from  the  center  to 
the  top  and  bottom.  It  has  the  same  operation  count  as  the  standard 
version  but  two  Independent  sequences  of  vector  operations  (on  the  top 
and  bottom  halves  of  systems,  respectively)  can  now  be  performed  In 
parallel  on  the  two  pipes,  giving  rise  to  a  vector  length  throughout  of 
M.  This  algorithm  has  been  known  for  some  time;  It  Is  used,  for 
example.  In  the  UNPACK  package  (Dongarra,  et  al.,  1979)  to  reduce  the 
overheads  due  to  loop  control  logic  on  scalar  machines,  and  by  others 
(Temperton,  1980)  because  on  constant  coefficient  symmetric  systems  It 
requires  half  the  storage  of  the  standard  version.  Despite  the  fact 
that  It  Is  Ideally  suited  to  the  TIASC,  this  Is,  to  our  knowledge,  the 
first  Implementation  of  the  method  on  this  machine.  Two  classes  of 
trl diagonal  systems  need  to  be  considered: 

e  Constant  coefficient  symmetric  systems  -  for  uniform-uniform 

Helmholtz  solvers. 

e  Variable  coefficient  non -symmetric  systems  -  for  stretched- 

unlform  and  stretched -stretched  Helmholtz  solvers. 

In  both  cases  all  systems  are  diagonally  dominant  so  pivoting  for 
stability  Is  unnecessary. 

In  the  constant  coefficient  case,  scalar  cyclic  reduction  (with 
precomputed  coefficients)  Is  faster  than  all  variants  of  Gaussian 
elimination:  see  Graph  1.  It  usually  requires  less  storage  than  the 
latter  and  It  Is  hence  the  method  of  choice  for  this  class  of  systems. 
In  the  variable  coefficient  case,  cyclic  reduction  requires  a 
prohibitive  amount  of  storage  for  constants.  Moreover,  Graph  1  shows 
that  Gaussian  elimination  with  folding  using  one  mesh  of  storage  for 
constants  (l.e.,  the  diagonal  factors)  Is  very  fast,  and  It  Is  the 
method  chosen  for  these  systems.  Note  that  a  variant  using  two  meshes 
of  storage  would  be  slightly  faster  (but  the  difference  In  time  does  not 
offset  the  Increased  storage  cost)  and  that  a  variant  requiring  no 
storage  space  for  constants  Is  significantly  slower  (due  mainly  to  the 
cost  of  performing  divisions  on  the  TIASC).  The  latter  method  Is  still 
retained  as  an  option  In  the  stretched-stretched  Helmholtz  solver 
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because  the  trldlagonlal  solutions  account  for  only  a  small  percentage 
of  the  total  solution  time  In  large  problems. 


RECTANGULAR  HELMHOLTZ  SOLVERS 


NORDA's  finite  difference  Helmholtz  solvers  can  be 
characterized  by  four  parameters;  two  describing  the  boundary  conditions 
on  the  vertical  and  horizontal  sides,  respectively,  and  two  describing 
the  type  of  mesh  spacing  (uniform  or  stretched)  In  the  horizontal,  or  x, 
direction  and  the  vertical,  or  y,  direction,  respectively.  Solution 
methods  are  most  conveniently  grouped  according  to  mesh  spacing  type  and 
this  system  Is  used  below.  In  this  report  the  term  'Neumann  boundary 
conditions'  will  always  apply  to  the  staggered  grid  Neumann  boundary 
conditions  encountered  In  ocean  models  used  at  NORDA. 

Uniform-Uniform  Solvers 

These  solvers  are  all  Implementations  of  the  Fast  Fourier 
Transform  method  due  to  Hockney  (Hockney,  1970),  which  Is  most  commonly 
denoted  as  FACR(O).  Neumann-Neumann,  Dlrlchlet-Dlrlchlet,  Neumann- 
periodic  and  Dlrlchlet-perlodic  variants  are  available.  In  each  case 
the  horizontal  dimension  Is  arbitrary  but  the  vertical  (Internal) 
dimension  Is  restricted  to  2  x  2P  +  q,  or  3  x  2P  +  q,  or  5  x  2P  +  q 
where  the  largest  p  Is  (approximately)  7  and  q  Is  -1,  0,  or  +1  for 
Dlrlchlet,  Neumann  and  periodic  cases  respectively.  Solvers  with 
periodic  boundary  conditions  on  the  vertical  sides  are  not  Implemented 
because  of  the  lack  of  a  periodic  trldlagonal  solver.  Probably  the  only 
Important  example  of  this  type  Is  the  periodic-periodic  solver  and  this 
Is  not  currently  required  at  NORDA. 

The  uniform-uniform  solvers  are  the  fastest  and  most  storage 
efficient  available  at  NORDA  and  are  probably  the  most  heavily  used.  No 
significant  changes  have  been  made  to  the  TIASC  versions  of  these 
solvers  this  year. 

Stretched-Unlform  Solvers 

The  uniform-uniform  FACR(O)  method  described  above  can  also  be 
used  In  stretched-unlform  cases.  The  only  required  modification  to  the 
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solver  Is  the  replacement  of  a  symmetric  constant  coefficient 
trldlagonal  solver  with  a  non -symmetric  variable  coefficient  version. 
The  resulting  codes  are  very  nearly  as  fast  as  the  original  uniform  grid 
solvers  (less  than  five  percent  slower)  but  require  more  storage.  Note 
that  tiie  stretched  coordinate  Is  always  associated  with  Neumann  or 
Dlrlchlet  boundary  conditions. 

The  stretched-unlform  solvers  were  originally  supplied  with  a 
standard  Gaussian  elimination  trldlagonal  solver  requiring  two  meshes  of 
storage  for  constants.  This  has  been  replaced  by  a  non-standard  version 
of  Gaussian  elimination  which  Is  faster  (on  the  TIASC)  and  requires  only 
one  mesh  of  storage  for  constants.  Stretched-unlform  solvers  should 
probably  see  more  use  at  NORDA  than  they  have  up  to  now. 

Stretched-Stretched  Solvers 

The  lack  of  a  good  stretched-stretched  solver  was  seen  as  a 
major  shortcoming  of  NORDA' s  program  collection  and  two  new 
Implementations  of  very  different  methods  have  been  produced  to  fill 
this  gap.  The  only  stretched  solver  previously  available  at  NORDA  was  a 
generalized  block  cyclic  reduction  code  (Swarztrauber  and  Sweet,  1975); 
however,  this  Is  not  suitable  for  use  on  vector  machines.  The  obvious 
alternative  Is  matrix  eigenvalue  decomposition  (MED)  (Famel  1,1980),  but 
this  has  a  larger  asymptotic  operation  count  and  has  been  found, 

experimentally,  to  be  slower  than  cyclic  reduction  (In  FORTRAN  on  scalar 
machines)  over  the  range  of  practical  grid  sizes. 

The  major  expense  In  the  solution  phase  of  MED  Is  two  matrix- 
matrix  multiplications,  each  with  a  scalar  operation  count  of  2  NX2.  NY 
on  an  NX  by  NY  mesh.  But  on  the  TIASC  the  operation  count  can  be 
halved,  to  NX2. NY,  fay  utilizing  this  machine's  fast  dot  product 

capability,  and  thus  almost  halving  the  total  solve  time.  It  should  be 

noted  that  similar  relative  speeds  are  obtainable  on  most  vector 

processors  (e.g.,  TIASC,  CRAY-1,  Cyber  205)  and  many  scalar  machines  (by 
using  an  optimized  machine  code  routine),  but  not  on  the  Cyber  203.  NED 
still  has  an  operation  count  of  0(NX2.NY)  compared  to  0(NX.NY.  log  NY) 
for  the  FACR(O)  routine  used  on  stretched-unlform  problems.  However  on 
the  TIASC  It  Is  only  three  times  slower  at  NX  *  64,  which  Is  acceptable 


for  a  stretched- stretched  solver  (generalized  block  cyclic  reduction  Is 
about  four  times  slower  than  FACR(O)  on  scalar  computers). 

Noteworthy  features  of  the  MED  implementation  at  NORDA  Include: 

•  The  required  elgensystem  Is  calculated  using  REAL*8  arithmetic 
In  a  preprocessing  stage,  but  may  be  stored  and  used  in  the 
solution  phase  In  REALM  where  appropriate. 

•  NX  must  be  no  greater  than  NY,  for  efficiency  of  storage  and 
time.  There  are  no  other  restrictions  on  the  size  of  NX  or  NY. 

•  Two  versions  of  the  required  tridiagonal  solver  are  supplied: 
the  faster  version  requires  one  mesh  of  storage  for  constants, 
the  slower  requires  no  storage  and  gives  rise  to  just  six 
percent  increase  In  total  solve  time  at  NX  *  100.  The  latter 
version  may  be  preferred  when  storage  Is  the  critical  resource. 

•  Total  constant  storage  requirements  are  2  NX2  or  2  NX2  +  NX.  NY 
words,  depending  on  the  tridiagonal  solver  used. 

•  The  solver  has  a  similar  calling  sequence  to  the  FACR(O) 
solvers. 

The  alternative  stretched-stretched  solver  uses  a  staballzed 
marching  method.  This  solver  Is  not  usually  competitive  with  MED 
because  It  has  a  very  high  storage  requirement  and  must  be  run  In 
REALM.  If  REALM  accuracy  Is  sufficient,  MED  Is  faster  for  all 
practical  grid  sizes.  Marching  can  be  applied  to  more  general  problems 
than  Helmholtz's  equation  on  a  rectangle  and  so  the  method  will  be 
discussed  more  fully  In  a  latter  section. 

Rectangular  Solver  Statistics 

Storage  and  time  per  node  statistics  are  given.  In  Table  1  and 
Graph  2,  for  various  Neumann-Neumann  solvers  on  square  regions  (l.e.,  NX 
■  NY).  A  rectangle  with  NX  ■  2  NY  might  be  more  typical  of  those 
encountered  In  practice  but  timing  Information  Is  twice  as  expensive  to 
obtain  In  this  case.  Graph  2  Is  approximately  valid  for  general 
rectangular  regions  provided  the  smaller  of  NX  and  NY  Is  used  on  the 
bottom  axis  for  the  stretched-stretched  solvers  and  NY  Is  used  for  the 
FACR(O)  solvers.  Table  1  also  carries  over  to  rectangles  except  that 
MED  values  (NNSSH.D)  represent  upper  limits,  for  example  If  NY  ■  2  NX 
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Table  1:  RECTANGULAR  SOLVER  STORAGE  REQUIREMENTS  (ON  TIASC) 


NX  =  NY 

STORAGE  (REAL*4  MESHES) 

SOLVER 

16 

32 

64 

128 

NNUUH.F 

1.2 

0.8 

0.5 

0.3 

NNSUH.F 

1 

1 

1 

1 

NNSSH.D 

3 

3 

3 

3 

2 

2 

2 

2 

MMSSH.M 

8 

16 

32 

60 

8 

12 

20 

40 

NNUUH.F  -  a  Neumann-Neumann  uniform-uniform 
FACR(O)  Helmholtz  solver. 

NNSUH.F  -  a  Neumann-Neumann  stretched-uniform 
FACR(O)  Helmholtz  solver. 

NNSSH.D  -  a  Neumann-Neumann  stretched-stretched 
MED  Helmholtz  solver. 

MMSSH.M  -  a  mixed-mixed  stretched-stretched 

staballzed  marching  Helmholtz  solver. 

Note:  Where  appropriate  the  storage  requirement  of  the  fastest 
mode  Is  given  above  the  requirement  of  the  minimum  storage 
mode. 


Internal  nodes  =  NX* ) 


GRAPH  2 

1)  NNUUH.F  for  NX  -  8,  10,  12,  16,  20,  24,  32,  40,  48,  64,  80,  and  96. 

2)  NNSSH.O  fastest  mode. 

3)  NNSSH.D  storage  efficient  mode 

4)  MMSSH.M  fastest  mode  (no  relaxation  sweeps). 

5)  MMSSH.M  storage  efficient  mode 

Note: 

a)  Times  for  NNSUH.F  (not  shown)  are  within  five  percent  of 
those  for  NNUUH.F 

b)  NNUUH.F,  NNSSH.D  times  are  for  REALM  calculations  on  the 
TIASC. 

c)  MMSSH.M  times  are  for  REALM  calculations  on  the  TIASC  with 
solution  accurate  to  five  significant  decimal  digits. 

d)  Preprocessing  time  Is  not  Included,  this  can  be  large  for 
NNSSH.D  and  MMSSH.M. 
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this  solver's  storage  requirements  Mould  be  2  or  1  meshes  Instead  of  3 
or  2  meshes  respectively. 

Storage  requirements  of  all  the  solvers  except  the  marching 
code  are  reasonable,  although  the  amount  of  storage  required  Increases 
with  the  complexity  of  the  grid.  The  marching  solver  has  a  very  large 
storage  requirement,  for  example  sixty  128  x  128  meshes  Is  960  K  words 
and  the  overhead  Increases  relative  to  the  other  solvers  with  Increasing 
rectangle  size.  This  disadvantage  Is  accentuated  on  machines  such  as 
the  TIASC  because  the  marching  solver  must  always  run  In  REAL*8.  If 
double  precision  accuracy  were  required  In  the  solution  then  the  storage 
requirements  of  all  the  other  solvers  would  double  but  that  of  the 
marching  code  would  Increase  only  slightly.  However  martlx  eigenvalue 
decomposition  still  has  a  significant  storage  advantage  even  In  REAL*8. 

As  expected  from  operation  counts  the  FACR(O)  codes  are  the 
fastest.  But  note  that  the  rectangle  (actually  NY  only)  Is  restricted 
to  fixed  sizes  and  that  the  5  x  2P  solver  Is  slightly  slower  than  the  3 
x  2P  and  4  x  2P  variants.  Time  per  node  for  these  codes  decreases  with 
Increasing  square  size,  contrary  to  expectation,  because  of 
vector! zatl on  effects,  which  also  account  for  the  sharp  Increase  In 
solve  time  on  small  rectangles. 

Time  per  node  for  the  MED  solver  shows  a  linear  Increase  with 
NX  as  expected  from  Its  operation  count.  It  Is  always  faster  than  the 
marching  solver  on  practical  grid  sizes  and  only  takes  four  times  as 
long  as  FACR(O)  even  at  NX  *  100.  Vector! zable  effects  are  less 
Important  than  with  FACR(O)  and  MED  Is  the  fastest  solver  available  at 
NORDA  for  very  small  rectangles.  The  storage  efficient  version  takes 
relatively  little  extra  time  and  may  therefore  be  preferred  In  some 
cases. 

The  graph  lines  for  the  marching  solver  exhibit  a 
characteristic  sawtooth  shape.  This  Is  because  the  total  time  Includes 
a  component  which  depends  only  on  the  number  of  sub -blocks  used.  The 
time  per  node  jumps  whenever  the  size  of  the  rectangle  Is  such  that 
another  block  must  be  added  to  obtain  the  required  accuracy.  The  time 
per  node  also  exhibits  a  linear  trend  with  NX,  but  with  a  far  shallower 
slope  than  the  MED  solver.  The  marching  code  is  certainly  fast  enough 
to  be  a  practical  stretched -stretched  solver,  but  on  rectangular 


regions  MED  always  requires  far  less  storage  and  It  Is  also  faster  for 
practical  grid  sizes  provided  REAL*4  accuracy  Is  sufficient.  In  storage 
efficient  mode  the  marching  solver  requires  about  two-thirds  of  the 
storage  as  In  the  standard  mode,  but  takes  almost  twice  as  long  per 
solution  and  Is  therefore  Impractical  In  this  context.  Marching  solvers 
are  not  limited  to  Helmholtz's  equation  and  are  almost  certainly  the 
fastest  known  general  elliptic  PDE  solvers.  They  are  also  viable 
Helmholtz  solvers  for  non-rectangular  regions,  particularly  on 
stretched-stretched  grids. 
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NON-RECTANGULAR  ELLIPTIC  SOLVERS 

Rectangular  Helmholtz  solvers  are  the  most  efficient 
elliptic  POE  solvers  available.  For  this  reason,  among  others, 
rectangular  ocean  basins  are  frequently  used  at  NORDA.  However 
realistic  models  of  actual  ocean  basins  do  not  give  rise  to  rectangular 
regions,  and  so  a  need  exists  for  non-rectangular  Helmholtz  solvers. 
Furthermore  one  of  the  layered  ocean  models  occasionally  used  at  NORDA 
can  Involve  streamfunctlon  equations  with  a  more  general  elliptic  POE. 
So  a  need  exists  for  a  non-rectangular  general  POE  solver.  This  would 
also  be  useful  If  a  non-separable  mesh,  such  as  that  from  a  stereo¬ 
graphic  projection  onto  the  sphere,  were  to  be  used  In  an  ocean  model. 

Non-Rectangular  Helmholtz  Solvers 

Probably  the  fastest  method  for  this  case  Is  the 
capacitance  (or  capacity)  matrix  technique  CMT,  although  the  marching 
methods  discussed  In  the  next  section  can  also  be  competitive, 
particularly  on  stretched  grids. 

The  CMT  Involves  solving  a  related  problem  In  an  enclosing 
rectangle  (using  say  FACR(O)),  calculating  a  correction  via  a  relatively 
small  dense  capacitance  matrix  equation,  then  solving  the  corrected 
rectangular  system  (using  the  same  rectangular  solver).  Two  variants  of 
this  method  are  available  at  NORDA: 
e  The  Direct  CMT. 
t  The  Preconditioned  CMT. 

The  Direct  CMT  Is  the  best  known  variant  (Buzbee  et  al., 
1971),  It  explicitly  uses  the  precalculated  and  stored  capacitance 
matrix  to  generate  the  correction  factors.  This  Is  very  fast  but  the 
storage  overhead  for  the  capacitance  matrix  can  be  very  large  (between  0 
and  25  meshes  with  9  meshes  being  typical).  The  preconditioned  CMT 
(Wallcraft,  1980),  exploits  the  relationship  between  the  finite 
difference  Helmholtz  equation  on  the  rectangle  and  the  capacitance 
matrix  to  compute  the  correction  factors  Iteratively  without  storing  the 
capacitance  matrix.  Because  each  Iteration  requires  a  rectangular 
solve,  the  already  rapid  convergence  Is  accelerated  by  preconditioning 
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with  a  stored  periodic  banded  approximate  Inverse  to  the  capacitance 
matrix.  Solution  time  depends  most  critically  on  the  size  of  the 
Helmholtz  coefficient  and  the  required  accuracy.  With  a  good  Initial 
approximation  to  the  solution,  times  typically  range  from  slightly 
faster  than  the  Direct  CUT  to  four  times  this  value.  In  general  the 
direct  variant  should  be  used  If  sufficient  storage  Is  available  but  the 
Preconditioned  CMT  provides  a  viable  alternative,  particularly  for 
equations  with  a  large  Helmholtz  coefficient. 

Solvers  are  available  at  NORDA  for  Dlrlchlet  or  Neumann 
problems  on  uniform  grids.  Versions  for  stretched -uni form  or  stretched- 
stretched  grids  could  easily  be  generated.  However  the  latter  would  use 
matrix  eigenvalue  decomposition  as  a  rectangular  solver  and  would 
probably  not  be  competitive  with  a  marching  solver.  The  package  has 
been  written  in  such  a  way  that  It  Is  easy  to  change  from  direct  to 
preconditioned  variants  as  required.  The  latest  versions  Incorporate  a 
much  proved  user  Interface  and  full  Internal  documentation. 

Results  are  presented  from  a  test  of  the  CMT  packages  on 
the  standard  90  x  50  Gulf  of  Mexico  geometry  with  tubes  at  Inflow  and 
outflow  (Hurlburt  and  Thompson,  1980)  with  either  a  one  layer  seml- 
Impllclt  free  surface  primitive  equation  model  (Neumann  boundary 
conditions)  or  a  one  layer  quasi -geostrophlc  model  (Dlrlchlet  boundary 
conditions),  together  with  results  from  a  slightly  more  realistic 
geometry  for  the  Gulf  with  the  semi -Implicit  model  only.  The  times  are 
for  2,000  tlmesteps  each  representing  90  minutes  model  time  for  a  total 
simulation  of  125  days  (  a  practical  model  simulation  would  be  for  about 
5  years).  All  times  are  In  seconds  on  the  two-pipe  TIASC. 

The  storage  requirements  depend  on  the  number  of  boundary 
nodes  within  the  enclosing  rectangle.  In  terms  of  meshes,  the  storage 
requirement  of  the  direct  CMT  depends  only  on  the  region  geometry  (not 
on  mesh  size),  and  that  of  the  PCMT  Is  typically  one  mesh  or  less.  The 
reduced  gravity  experiment  (g  ■  .02)  with  the  semi -Implicit  model  gives 
rise  to  a  large  Helmholtz  coefficient  and  therefore  the  preconditioned 
variant  is  faster  than  on  the  barotrophlc  experiment  (g  *  9.80).  Indeed 
PCMT  Is  almost  as  fast  as  DCMT  In  reduced  gravity  cases.  The  quasi - 
geostrophlc  streamf unction  calculation  involves  Poisson's  equation  and 
therefore  has  times  very  similar  to  the  barotrophlc  semi -Implicit 
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i: 

1.  model.  But  note  that  the  standard  Iterative  Dlrlchlet  CMT  (bandwidth  ■ 

0)  Is  considerably  slower  than  the  preconditioned  variant  (bandwidth  > 
j  0),  In  contrast  with  the  Neumann  case  In  which  the  underlying  Iterative 

variant  Is  very  fast.  Solution  times  for  PCMT  can  vary  substantially  on 
different  geometries,  as  the  two  Neumann  examples  Illustrate,  but  Is 
relatively  Insensitive  to  mesh  size,  e.g.,  times  for  a  10  km  grid  should 
be  very  similar  (relative  to  the  DCMT)  to  those  on  the  20  km  grid  shown 
here.  The  preprocessing  times  are  larger  than  might  be  expected  because 
the  matrix  Inversion  routine  used  has  poor  vectorlzatlon  properties. 
But  they  Illustrate  that  the  PCMT  Is  comparable  to  the  direct  variant  In 
the  area. 

REGIONS 


G.O.M.  (1)  90  X  50  Mesh 
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TABLE  3:  G.O.M.  (1)  -  DIRICHLET  CASE 


Solve  Total  %  of 
Method  Band-  Storage  time  time  OCMT 
width  (meshes)  (sec)  (sec)  total 


(128) 


3.9  40.0  92.2  100% 


101.5 
112.9 
129.3 

346.6 


153.7 
165.1 
181.5 

398.7 


Preprocessing 

time 

(sec) 


TABLE  4:  G.O.M.  (2)  -  NEUMANN  CASE 


9 

Method 

Band¬ 

width 

Storage 

(meshes) 

Solve 

time 

(sec) 

Total 

time 

(sec) 

%  of 
DCMT 
total 

Preprocessing 

time 

(sec) 

- 

DCMT 

(178) 

7.5 

45.1 

93.1 

loot 

34.0 

PCMT 

27 

1.1 

48.7 

96.7 

104t 

26.3 

27 

1.1 

149.1 

197.1 

212t 

25.8 

PCMT 

21 

0.9 

165.4 

213.4 

230% 

14.8 

9.80 

PCMT 

11 

0.5 

177.1 

225.1 

242% 

5.1 

9 

0.0 

185.7 

233.7 

251% 

1.9 

Non -Rectangular  General  Elliptic  POE  Solvers 

A  variant  of  the  stabilized  marching  method  Is  the  only  fully 
documented  and  tested  solver  In  this  class  at  NOROA.  The  algorithm  used 
was  originally  presented  for  rectangular  regions  (Madala,  1978  ),  but 
NOROA' s  most  recent  implementations  can  handle  general  bounded  regions 
with  Neumann  or  Dlrlchlet  boundary  conditions.  The  original  Irregular 
geometry  versions  were  delivered  to  NOROA  by  JAYCOR  under  ONR  Contract 
Number  N00014-80-C-0298  (Dltrlch,  1981).  The  solvers  now  available  at 
NOROA  Incorporate  minor  corrections  to  the  program  logic  together  with 
more  extensive  modifications  which  include: 

e  Recoding,  in  FORTRAN,  to  allow  optimal  vectorlzatlon  on  the 
TIASC. 

e  Replacement  of  a  "bitmap"  region  representation  with  a  line 
based  representation. 

e  Merging  versions  with  and  without  the  capability  for  relaxation 
sweeps. 

•  Production  of  new  versions  for  Helmholtz's  equation  on 
stretched -stretched  and  uniform-uniform  meshes. 

Stabilized  marching  algorithms  are  made  up  of  two  sub- 
algorithms,  marching  over  sub -blocks  containing  a  small  number  of  lines 
(typically  7-15)  and  calculating  corrections  at  sub -block  boundaries  to 
ensure  that  the  equation  Is  satisfied  everywhere.  Stabilized  marching 
algorithms  differ  In  how  the  latter  stage  Is  performed,  but  In  this 
variant  the  correction  phase  Involves  two  matrix-vector  multiplies  at 
each  sub-block  boundary.  Both  the  marching  and  matrix  multiply  stages 
are  vectorlzable  but  operations  applied  to  block  boundaries  may  only  run 
In  scalar  mode.  This  Is  because  the  sub-block  boundary  nodes  are  of  two 
types,  nodes  dividing  two  sub -blocks  and  nodes  within  the  sub -block  but 
adjacent  to  the  (irregular)  region  boundary.  The  natural  way  to 
represent  such  a  boundary  Is  to  keep  a  list  of  the  mesh  coordinates  of 
each  boundary  node  and  It  Is  this  representation  which  gives  rise  to 
scalar  boundary  operations.  The  vectorlzable  alternative  Is  to 
explicitly  treat  tire  boundaries  In  two  parts,  nodes  on  the  line  dividing 


the  sub-blocks  are  treated  as  ( vector Izable)  lines  of  nodes  and  nodes  In 
the  Interior  of  the  sub-block  are  listed  as  before.  A  large  fraction  of 
the  nodes  (perhaps  90  percent)  tend  to  be  In  the  first  group  al lowing 
significant  vectorlzatlon  of  boundary  operations.  If  the  region  is 
rectangular  there  are  no  nodes  in  the  second  group  so  boundary 
operations  are  fully  vector Izable.  As  an  example  of  the  difference  this 
modification  can  make,  the  original  solver  takes  145  percent  of  the  time 
for  the  fully  vectorized  code  on  a  typical  problem  over  a  38  by  78 
rectangle. 

The  original  code  came  with  different  versions  for  three  region 
types:  rectangular,  non-rectangular  without  Islands,  and  non-rectangul ar 
with  Islands.  The  need  for  a  separate  code  for  rectangles  was  removed 
by  the  above  modification  to  the  sub-block  boundary  data  structure.  A 
separate  code  for  cases  with  Islands  was  still  required  because  a  bitmap 
(actually  a  REAL  array)  was  used  In  such  cases  to  prevent  the  marching 
process  from  acting  on  Island  nodes.  This  data  structure  Is  again  the 
most  obvious  but  It  costs  several  extra  floating  point  operations  per 
mesh  node  in  each  marching  phase  and  an  extra  mesh  of  storage.  The 
alternative  now  used  Is  to  list  the  beginning  and  end  of  each  strip  of 
sea  within  each  mesh  line.  The  marching  process  can  now  be  controlled 
by  program  logic  which  only  applies  It  to  those  nodes  on  which  It  Is 
appropriate.  The  Index  lists  are  relatively  compact  so  this  data 
structure  saves  both  storage  and  time.  On  a  38  x  78  non-rectangular 
region  with  one  small  Island  and  about  50  percent  of  mesh  nodes  on  land 
a  code  using  a  bitmap  took  120  percent  of  the  time  for  the  current 
version. 

As  was  shown  In  Graph  2  marching  solvers  are  very  fast  and 
versions  have  therefore  been  produced  to  solve  Helmholtz's  equation  over 
non-rectangular  regions  on  uniform-uniform  or  stretched-stretched 
grids.  These  versions  are  required  because  four  of  the  five  mesh  arrays 
needed  for  PDE  coefficients  In  the  general  case  are  not  required  for 
Helmholtz's  equation.  The  uniform  grid  variant  Is  sllghlty  faster  than 
the  general  solver,  and  the  stretched  grid  version  slightly  slower. 
There  Is  approximately  a  20  percent  difference  In  speed  between  the 
stretched  solver  (which  was  used  In  Graph  2)  and  the  uniform  code. 


Marching  solvers  and  the  direct  CMT  have  very  different 
properties,  some  of  which  are  outlined  below. 

e  The  DCMT  Is  a  direct  Method  which  Is  generally  as  accurate  as 
the  underlying  rectangular  solver.  The  stabilized  Marching 
method  Is  also  direct  but  relies  on  the  controlled  loss  of 
precision.  For  this  reason  It  Must  always  run  In  REAL*8  on  IBM 
stylo  Machines  (such  as  the  TIASC). 

e  The  storage  requirement  of  the  DCMT  depends  on  the  region 
geometry  but  usually  lies  between  0  and  25  meshes  with  9  meshes 
being  typical.  These  figures  can  be  halved  for  Dlrlchlet 
problems  and  In  all  cases  storage  requirement  (In  meshes)  Is 
approximately  Independent  of  grid  length.  The  storage 
requirement  of  the  marching  solvers  Is  more  closely  related  to 
region  area  than  to  region  shape  and  Increases  In  terms  of 
meshes  proportionally  to  the  square  root  of  the  number  of  nodes 
In  the  region.  Therefore  there  will  be  a  grid  length  below 
which  DCMT  will  require  less  storage  than  the  marching 
solvers.  However  even  If  the  marching  code  must  run  In  REAL *8 
It  can  require  less  storage  than  the  direct  CMT  on  some  regions 
with  practical  grid  lengths. 

e  The  solution  time  of  the  DCMT  depends  on  the  size  of  the 
enclosing  rectangle  and  lies  between  2  and  3  times  that  of  the 
corresponding  rectangular  solver,  l.e.,  FACR(O).  Graph  2 
Indicated  that  marching  solver  times  lie  between  3  and  4  times 
that  of  FACR(O)  on  the  TIASC  when  the  latter  uses  REALM 
arithmetic.  But  the  marching  process  Is  only  applied  to  nodes 
Inside  the  required  region  so  the  rectangular  case  times  must 
be  reduced  appropriately.  Marching  solvers  can  be  faster  than 
the  DCMT  when  a  large  fraction  of  the  enclosing  rectangle  lies 
outside  the  Irregular  region.  This  can  occur  either  because  of 
the  Intrinsic  region  shape  or  because  the  enclosing  rectangle 
for  the  CMT  Is  enlarged  to  conform  to  the  FACR(O)  mesh 
dimension  constraints. 

Numerical  results  are  presented  for  non -rectangular  solvers  on 
an  Idealized  representation  of  the  Mediterranean  Sea  for  two  grid 
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sizes.  The  equation  solved  was  Poisson's  equation  with  Dlrlchlet 
boundary  conditions.  This  region  Is  highly  favorable  to  the  marching 
solver,  which  Is  slightly  faster  than  the  DCMT  in  both  cases  and  also 
requires  less  storage.  The  enclosing  rectangle  used  by  the  CMT  solvers 
is  larger  than  that  used  by  the  marching  solvers  because  it  must  comply 
with  FACR(O)  constraints.  The  storage  for  the  DCMT  could  be  halved  In 
this  case  since  Dlrlchlet  boundaries  are  Involved.  The  storage 
requirements  of  the  marching  solvers  increase  faster  than  those  of  the 
DCMT,  and  the  general  elliptic  PDE  version  requires  significantly  more 
storage  than  the  uniform  solver.  The  PCMT  requires  relatively  little 
storage  but  is  significantly  slower,  relative  to  the  DCMT,  than  in 
previous  examples.  This  is  because  the  constriction  at  the  ankle  of 
Italy  causes  the  banded  preconditioning  matrix  to  be  a  poor 

approximation  to  the  inverse  of  the  capacitance  matrix. 

The  uniform  Helmholtz  marching  solver  can  be  competitive  with 
the  direct  CMT,  particularly  if  REAL*8  accuracy  is  required,  but  the 
latter  is  probably  preferable  overall  where  applicable.  The  stretched 
Helmholtz  marching  code  is  certainly  the  fastest  available  solver  for 
non-rectangular  problems,  and  has  no  competitors  at  NORDA.  The  general 
PDE  solver  is  very  fast  but  has  an  even  larger  storage  requirement  than 
the  Helmholtz  variant.  The  capability  for  relaxation  sweeps  can  reduce 
storage  overheads  slightly,  but  a  need  exists  for  a  more  storage 
efficient  solver,  comparable  In  speed  and  storage  requirements  to  the 
PCMT  for  these  problems. 
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TABLE  5:  80  x  40  MEDITERRANEAN  REGION  SOLVERS 


Method 

Enclosing 

Rectangle 

Time 

(sec) 

i  of  KMT  ” 
time 

Storage 

(words) 

'""KMT 

81  x  49 

.016 

loo* 

““  30R 

MMUUH.M 

81  x  41 

.014 

88* 

18K 

MMSSG.M 

81  x  41 

.016 

100* 

44K 

PCMT 

81  x  49 

.063 

394* 

8K 

PCMT 

81  x49 

.084 

525* 

3K 

MMUUH.M  -  uniform  grid  Helmholtz  equation  stabilized  marching  solver. 
MMSSG.M  -  general  elliptic  POE  stabilized  marching  solver. 


TABLE  6:  160  x  80  MEDITERRANEAN  REGION  SOLVERS 


Method 

Enclosing 

Rectangle 

Time 

*  of  DCMT 
Time 

Storage 

(words) 

DCMT 

161  x  97 

.064 

100* 

138K 

MMUUH.M 

161  x  81 

.062 

97* 

114K 

MMSSG.M 

161  x  81 

.071 

111* 

215K 

PCMT 

161  x  97 

.270 

422* 

22K 

PCMT 

161  x  97 

.380 

594* 

14K 

SOLVERS  ON  THE  CYBER  203 


Access  to  Fleet  Numerical's  Cyber  203  has  been  very  limited  to 
date  and  hence  less  work  than  planned  has  been  performed  on  this 
machine. 

The  TIASC  and  Cyber  203  are  both  vector  processors;  that  Is, 
full  machine  speed  Is  obtainable  only  when  performing  simple  operations 
on  large  regularly  ordered  structures,  but  they  are  realizations  of 
vastly  different  design  philosophies.  The  TIASC  Is  supplied  with  a  very 
basic  set  of  vector  operations  and  Its  power  lies  In  the  generality  of 
its  definition  of  what  constitutes  a  vectorlzable  data  structure. 
Almost  all  (non -recursive,  unconditionally  executed)  floating  point  or 
Integer  operations  acting  on  FORTRAN  arrays  with  subscripts  linearly 
dependent  on  up  to  three  loop  control  variables  are  theoretically 
equivalent  to  performing  just  one  TIASC  vector  operation.  Furthermore 
the  TIASC' s  FORTRAN  compiler  will  recognize  such  operations  In  standard 
FORTRAN  and  generate  the  appropriate  vector  code.  The  Cyber  203,  on  the 
other  hand,  has  a  very  rich  set  of  operations  but  a  very  simple 
definition  of  vectors,  essentially  a  vector  Is  a  one  dimensional  FORTRAN 
array. 

On  paper  the  Cyber  203  Is  the  more  flexible  machine,  that  Is 
any  vector  operation  on  the  TIASC  can  be  simulated  by  a  sequence  of  one 
or  more  vector  operations  on  the  Cyber  203  but  there  2 re  sever#', 
(useful)  vector  operations  on  the  Cyber  203  which  are  equivalent  to 
scalar  code  on  the  TIASC.  However  some  vector  operations  on  the  Cyber 
203  actually  run  at  scalar,  or  even  slower  speed  (due  apparently  to 
shortcomings  In  the  vector  hardware  design,  which  is  that  of  the  STAR- 
100)  and  these  Include  data  motion  operations  which  must  be  heavily  used 
to  compensate  for  the  simplicity  of  the  vector  definition.  Futhermore 
the  FORTRAN  compiler  on  the  Cyber  203  only  vectorizes  nonrecursive 
unconditionally  executed  floating  point  or  Integer  operations  acting  on 
FORTRAN  arrays  and  loop  Indices  which  can  be  trivially  transformed  Into 
one  Index  acting  contiguously  on  equivalent  one  dimensional  FORTRAN 
arrays  (l.e..  It  has  an  Inner  loop  vectorlzer  with  subscript  strength 
reduction).  The  Cyber  203  Is  therefore  the  less  flexible  machine  In 
standard  FORTRAN.  The  vector  hardware  problems  of  the  Cyber  203  have 
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been  resolved  in  Its  successor,  the  Cyber  205.  It  Is  to  be  hoped  that 
this  will  motivate  CDC  to  produce  a  good  vectorizing  FORTRAN  compiler 
for  the  Cyber  200  series. 

The  two -pipe  TIASC  has  a  maximum  REAL *4  vector  speed  of  25  M 
flops  and  a  relatively  long  vector  start  up  time  (half  machine  speed 
obtained  on  vectors  of  length  90),  the  Cyber  203  has  a  maximum  mean 
REAL*8  vector  speed  of  37  M  flops  (50  for  adds  but  only  25  M  flops  for 
multiplies)  and  an  even  larger  vector  start  up  time  (half  machine  speed 
on  vectors  of  length  150),  It  also  has  REALM  vector  capability  with 
maximum  speed  of  100  M  flops  but  this  Is  not  currently  available  In 
FORTRAN.  The  long  start  up  times  are  not  a  problem  for  the  explicit 
portion  of  three  dimensional  layered  ocean  codes  since  they  have  an 
Intrinsic  vector  length  of  0(10,000)  and  execute.  In  FORTRAN,  at  more 
than  90  percent  of  maximum  machine  speed  on  either  computer.  But  It  Is 
a  major  factor  In  elliptic  POE  solvers,  which  tend  to  operate  on  the  two 
dimensional  mesh  line  by  line,  l.e.,  act  on  vectors  of  length  0(100). 
This  problem  Is  more  serious  on  the  Cyber  203  for  two  reasons: 

•  It  has  the  longer  vector  start  up  times 

•  Its  FORTRAN  vectorl zer  Is  significantly  less  sophisticated  than 
that  of  the  TIASC. 

Only  one  solver  (the  Neumann -Neumann  uniform-uniform  Helmholtz 
solver)  has  so  far  been  transferred  to  the  Cyber  203  and  the  work 

Involved  Illustrates  clearly  the  problems  with  the  FORTRAN  compiler.  On 
the  TIASC  the  solver  has  the  form: 

•  Forward  fast  Fourier  transform  (FFT)  and  transpose. 

t  Tridiagonal  solution  by  cyclic  reduction. 

t  Transpose  and  reverse  FFT. 

Suppose  the  grid  contains  NX  by  NY  internal  mesh  nodes.  On  the 
TIASC  the  NX  forward  and  reverse  FFTs  are  performed  in  parallel  and  each 
Individual  FFT  Is  vectorlzable  with  average  vector  length  0(NY/log2 

NY).  Therefore  the  FFT  routines  (written  In  FORTRAN)  are  vectorlzable 
with  average  vector  length  0{NX.NY/log2  NY).  The  FFT  routine  Includes  a 
mesh  transpose  as  part  of  the  unscrambling  or  scrambling  phase  to 

facilitate  optimal  vectorlzatlon  of  the  tridiagonal  solution  phase.  The 
NY  trl diagonal  solutions  are  computed  In  parallel  using  cyclic 

reduction,  which  has  an  average  vector  length  of  NX/log2  NX  on 


Individual  systems,  for  a  total  vector  length  of  NY.NX/log2  NX.  Thus  In 
all  phases  of  the  algorithm,  vector  lengths  significantly  longer  than 
the  NX  or  NY,  Indicated  by  parallelism,  are  obtained.  This  is  a  direct 
result  of  the  TIASC  FORTRAN  capability  for  vectorizing  nested  loops. 

On  the  Cyber  203  the  modified  solver  has  the  form: 

•  Forward  FFT 

•  Transpose 

•  Tridiagonal  solution  by  Gaussian  elimination 

•  Transpose 

•  Reverse  FFT 

With  a  few  minor  exceptions  an  individual  FFT  Is  not 
vectorlzable  on  the  Cyber  203  (because  the  potential  vector  elements  are 
not  contiguous  In  memory)  and  hence  the  FFT  phases  have  average  vector 
lengths  of  about  NX.  The  mesh  transposes  can  no  longer  be  overlayed 
with  the  unscrambling  process  In  the  FFT  because  the  FORTRAN  compiler 
does  not  recognize  the  corresponding  code  as  vectorlzable.  A  separate 
routine  containing  both  the  FORTRAN  code  (for  use  on  other  machines)  and 
a  call  to  a  machine  language  routine  which  uses  the  Cyber  203's  fast 
transpose  capability  has  been  Introduced.  A  single  cyclic  reduction 
solution  Is  not  vectorlzable  so  the  tridiagonal  solver  used  Is  Gaussian 
elimination  vectorized  across  systems  for  an  average  vector  length  of 
NY.  This  method  has  a  lower  operation  count  than  cyclic  reduction 
(which  was  only  faster  on  the  TIASC  because  of  Its  longer  vectors)  but 
has  a  storage  requirement  of  about  half  a  mesh  for  constants  and  this  Is 
perhaps  twice  that  of  the  TIASC  routine.  Because  of  deficiencies  In  the 
Cyber  203  FORTRAN  vector Izer  the  solver  can  not  be  set  up  as  a 
library.  It  must  be  recompiled  for  each  new  size  with  dimensions 
supplied  via  parameter  statements.  This  Is  not  a  serious  problem  In  an 
ocean  modeling  group  since  mesh  sizes  change  relatively  Infrequently. 

Graph  3  shows  that  the  total  solution  times  are  very  similar  on 
the  TIASC  and  Cyber  203,  although  the  Cyber  203  results  are  for  REAL*8 
calculations  against  REALM  on  the  TIASC.  The  maximum  Cyber  203  REAL*8 
speed  is  about  one  and  a  half  times  that  of  the  TIASC  In  REALM  mode,  so 
the  solver  Is  relatively  more  efficient  on  the  TIASC  as  expected  from 
the  longer  average  vector  lengths  on  this  machine.  The  fastest  Cyber 
203  times  In  scalar  mode  are  also  Included  on  Graph  3,  Illustrating  this 
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machine's  great  speed  as  a  scalar  processor.  The  TIASC  Is  a  relatively 
poor  scalar  processor  and  Its  corresponding  scalar  times  would  not  lie 
on  this  graph. 

The  major  effort  Involved  in  transferring  the  solver  to  the 
Cyber  203  does  not  lie  In  the  modifications*  Indicated  above*  required 
for  best  possible  vector 1z at ion,  but  In  recoding  necessitated  by  a  bug 
In  the  FORTRAN  compiler  which  caused  Incorrect  object  code  to  be 
generated  for  some  of  the  FFT  routines.  The  errors  are  probably  due  to 
the  complicated  subscripting  used  In  these  routines  (which  are  written 
In  totally  standard  ANSI  1966  FORTRAN)  and  It  is  therefore  likely  that 
they  will  recur  In  all  the  other  uniform-uniform  solvers.  However  it 
will  be  simpler  to  correct  the  remaining  eight  sets  of  FFT  routines 
since  the  nature  of  the  error  Is  not  understood.  It  was  the  detection 
and  correction  (by  experimentation)  of  the  errors  which  was  most  time 
consuming  In  this  first  set  of  routines. 


II. 


OCEAN  MODEL  DEVELOPMENT 


NORDA  has  four  types  of  layered  three  dimensional  ocean  models: 
e  Quasi -geos trophic 
e  Rigid-Lid 

e  Explicit  free  surface 
e  Semi -Implicit  free  surface. 

The  semi-implicit  free  surface  codes  are  the  most  frequently 
used  at  NORDA  and  they  are  also  the  most  general,  with  the  capability  of 
handling  non-rectangular  geometry,  prescribed  Inflow  and  outflow  with 
self  determining  profile.  The  explicit  free  surface  codes  have  similar 
capabilities  but  are  limited  by  constraints  on  maximum  time  step  to  a 
small  class  of  problems.  The  rigid-lid  and  qausl-geostrophlc  models 
could  only  (previously)  handle  wind  driven  flows  In  closed  rectangular 
flat  bottomed  basins.  For  this  reason  (among  others)  they  have  found 
little  use  at  NORDA,  although  this  qausl-geostrophlc  model  Is  probably 
the  most  widely  used  layer  type  model  In  the  numerical  ocean  modeling 
comnunlty  outside  NORDA. 

Versions  of  the  following  models  have  been  set  up  for  the 
standard  Irregular  geometry  Gulf  of  Mexico  model  region  with  tubes  at 
Inflow  and  outflow  (Hurlburt  and  Thompson,  1980). 
e  2-layer  rigid- lid  with  flat  bottom, 
e  2-layer  rlgld-T Id  with  bottom  topography, 
e  1- layer  qausl-geostrophlc. 
e  2-layer  qausl-geostrophlc  with  bottom  topography. 

Two  versions  of  the  rigid-lid  model  are  required  because  the 
streamfunctlon  equation  Involves  an  elliptic  equation  more  general  than 
the  usual  Helmholtz  equation  when  bottom  topography  Is  Introduced.  The 
capacitance  matrix  technique  Is  used  to  solve  for  the  streamfunctlon  In 
the  flat  bottomed  model,  but  a  marching  solver  must  be  used  In  cases 
with  bottom  topography.  The  two  layer  quasi -geos trophic  model  allows 
for  bottom  topography,  but  this  must  be  small  scale  If  the  model  Is  to 
be  consistent.  The  rlgld-1 1 d  Inflow  and  outflow  boundary  conditions  are 
very  similar  to  those  used  by  the  free  surface  models.  In  the  quasl- 
geo* trophic  models  the  profile  of  both  the  Inflow  and  the  outflow  Is 
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prescribed;  this  Is  acceptable  In  this  case  because  the  outflow  port  Is 
narrow. 

The  models  have  been  set  up  for  the  Gulf  of  Mexico  model  region 
but  It  Is  relatively  easy  to  change  to  any  other  geometry.  The  quasl- 
geos trophic  models  are  less  flexible  In  this  regard  than  either  the  free 
surface  or  the  rigid-lid  models  because  they  currently  require  the 
outflow  profile  to  be  predefined. 

The  one  layer  semi-implicit  model  has  been  successfully 
transferred  onto  the  Cyber  203.  A  report  has  been  prepared  outlining 
the  steps  to  take  when  transferring  any  model  onto  the  Cyber  203  If  full 
two  dimensional  vectorlzatlon  Is  required  (Wallcraft,  1981). 


Ill 


EQUATORIAL  DYNAMICS 


Introduction 

The  theory  of  solitary  waves  and  their  application  to 
geophysics  has  been  the  focus  of  a  variety  of  recent  studies.  The  works 
of  Clarke  (1971)  on  midlatitude  Rossby  soli tons.  Maxworthy  and  Redekopp 
(1976)  on  Jupiter's  Great  Red  Spot  and  Boyd  (1980)  on  equatorial  Rossby 
solltons  are  just  a  few  examples.  For  extensive  reviews  of  the 
literature,  the  reader  Is  referred  to  Scott  et  al.,  (1973)  and  Miles 
(1980). 

Boyd  (1980)  derived  analytical  expressions  for  Equatorial 
Rossby  solltons  (ERS)  using  as  his  governing  system  the  nonlinear  form 
of  the  shallow  water  wave  equations  for  a  homogeneous  layer  of  fluid. 
His  model  ocean  was  horizontally  unbounded.  The  effects  on  Rossby 
solltons  of  other  equatorial  waves  as  well  as  the  mean  circulation  were 
neglected.  The  system  was  reduced  to  the  solution  of  the  one- 
dimensional  KDV  equation  with  an  Initial  distribution  of  the  height 
field  as  the  forcing. 

The  general  solution  to  this  governing  system,  l.e.,  the 
"prototype"  equatorial  Rossby  sollton,  propagated  across  the  basin  with 
no  change  In  shape  or  amplitude.  In  the  corresponding  linear  solution, 
the  same  Initial  condition  rapidly  evolved  Into  a  wavetrain.  Boyd  found 
that  ERS  results  from  a  variety  of  Initial  conditions;  they  are  the  only 
stable  solution  to  the  Initial  value  problem.  He  speculated  that  these 
solitary  waves  would  most  likely  be  generated  at  the  eastern  boundary 
following  a  sudden  relaxation  of  the  equatorial  trades  such  as  that 
associated  with  El  Nino. 

Although  Boyd's  simple  model  Is  a  valuable  first  step.  It  Is 
Important  to  go  beyond  the  Initial  value  experiments.  In  this  study,  a 
li  layer  reduced- gravity  model  of  the  tropical  Pacific  Is  used  to 
examine  the  generating  mechanisms  for  equatorial  solltons.  In  the  first 
set  of  experiments,  the  model  Is  Initialized  with  the  prototype  sollton 
of  Boyd  In  order  to  demonstrate  the  behavior  of  the  "pure"  equatorial 
Rossby  Solitary  wave.  In  subsequent  experiments,  forcing  Is  applied  as 
a  patch  of  zonal  wind  stress.  It  Is  demonstrated  that  ERS  can  be 
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generated  by  wind  events  In  which  the  equatorial  trades  relax  for 
periods  of  a  few  weeks  to  months.  The  numerical  calculations  also 
support  Boyd's  hypothesis  that  equatorial  solltons  may  be  excited  during 
the  El  Nino  event. 

Initial  Value  Experiments 

The  numerical  model  Is  a  nonlinear,  li  layer  reduced-gravity 
model  of  the  equatorial  Pacific.  The  northern  and  southern  boundaries, 
which  are  open,  are  located  1,500  km  from  the  equator.  The  basin  Is 
rectangular  In  shape  and  extends  15,000  km  zonally.  For  the  experiments 
described  below,  the  Initial  upper  layer  thickness  Is  150  m  and  the 
stratification,  Ap  ,  Is  3ot  units.  A  more  detailed  description  of  the 
model  can  be  found  In  Kindle  (1979,  1981). 

The  model  Is  Initialized  with  Boyd's  solution  for  the  first 
latitudinal  mode  equatorial  Rossby  sollton  (Fig.  1).  Case  1  Is  the 
linear  solution.  The  effects  of  dispersion  cause  the  Initial  feature  to 
evolve  Into  a  wavetraln  (Fig,  2a).  In  Case  2  the  nonlinear  terms  are 
Included;  Fig.  2b  reveals  that  the  Initial  disturbance  propagates  across 
the  basin  with  very  little  change  In  shape  or  amplitude.  Clearly,  the 
nonlinear  terms  balance  the  tendency  to  disperse  thus  demonstrating  that 
the  model  reproduces  the  behavior  of  the  equatorial  Rossby  sollton 
derived  analytically  by  Boyd. 

Wind  Generation  of  Equatorial  Solltons 

In  the  experiments  described  below,  wind  stress  forcing  Is 
applied  to  the  model  ocean  In  order  to  test  whether  solltons  can  be 
excited  by  wind  events.  The  forcing  Is  In  the  form  of  a  patch  of  zonal 
wind  stress  with  a  uniform  meridional  distribution  and  a  top-hat  zonal 
shape.  The  wind  Is  directed  from  west  to  east  and  represents  a 
relaxation  of  the  mean  equatorial  trades.  As  described  by  McCreary 
(1977)  and  Kindle  (1979),  the  wind  patch  radiates  packets  of  Kelvin  and 
Rossby  waves.  The  Kelvin  waves,  which  have  a  much  larger  amplitude  than 
the  Rossby  waves,  propagate  a  downwelllng  pulse  toward  the  eastern 
boundary.  The  reflection  of  this  downwelllng  pulse  Initiates  westward 
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Figure  1: 


(la)  1.5 


10*  KM 


-1.5 


EQUATOR 


10*  KM 


15 


U*>)  OflY  '  0 


Initial  condition  for  Cases  1  and  2  -  the  prototype  Rossby 
sollton  derived  by  Boyd,  (a)  Zonal  velocity  component.  The 
contour  Interval  Is  5  cm/sec.  (b)  Model  pycnocllne  surface  as 
viewed  from  the  southwest  Pacific.  The  maximum  amplitude  of 
the  Initial  depression  Is  25  meters. 
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propagating  Rossby  waves.  The  subsequent  description  focuses  on  the 
response  of  the  first  latitudinal  mode  Rossby  wave  created  by  the 
eastern  boundary  reflection. 

In  Case  3,  the  wind  stress  has  a  zonal  width  scale  of  2,000  km 
and  an  amplitude  of  .5  dynes/cm2.  The  eastern  edge  of  the  patch  Is 
located  6,500  km  from  the  eastern  boundary.  The  forcing  Is  applied 
Impulsively  and  switched  off  at  Day  30.  Fig.  3  presents  the  results  of 
the  linear  calculation.  As  the  first  mode  Rossby  wave  propagates 
westward.  It  evolves  Into  a  wavetraln.  Although  not  as  pronounced  as  In 
Case  1,  the  effects  of  dispersion  are  very  evident.  The  nonlinear 
counterpart  to  this  experiment  (Case  4)  Is  shown  In  Fig.  4.  The  first 
mode  Rossby  wave  propagates  across  the  basin  and  exhibits  only  slight 
changes  In  Its  shape  and  amplitude.  The  differences  between  the  linear 
and  nonlinear  experiments  and  the  similarity  between  this  solitary  wave 
and  the  prototype  sollton  lead  to  the  conclusion  that  this  wind  event 
has  generated  an  equatorial  sollton. 

Although  not  shown.  It  was  found  that  wind  events  with  time 
scales  from  two  weeks  to  one  month  generate  solltons  similar  to  Case 
4.  This  Is  true  even  If  the  amplitude  of  the  forcing  Is  only  .25 
dynes/cm2.  Events  with  time  scales  greater  than  one  month,  however,  can 
generate  multiple  Rossby  solltons,  the  number  of  which  depends  upon  the 
product  of  the  time  scale,  zonal  width  scale  and  amplitude  of  the  wind 
event.  Simple  El  Nino  simulations  suggest  that  this  Intense  event  would 
very  likely  excite  multiple  solltons  subsequent  to  the  reflection  of  the 
Kelvin  wave  front  from  the  eastern  boundary. 

Summary 


^Numerical  experiments  were  performed  In  order  to  examine  the 
generating  mechanisms  for  equatorial  Rossby  solltons.  It  was  shown  that 
wind  events  representing  a  relaxation  of  the  equatorial  trades  can 
generate  these  solitary  waves  subsequent  to  the  reflection  of  the  Kelvin 
wave  from  the  eastern  boundary.  If  the  duration  of  the  event  Is  greater 
than  one  month.  It  Is  possible  for  multiple  solltons  to  form.  This  Is 
particularly  true  for  large  scale  events  like  El  Nino.  The  numerical 
calculations  support  the  hypothesis  of  Boyd  (1980)  that  solltons  would 
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Figure  3:  (a)  The  zonal  velocity  component  at  Day  125  for  Case  3  (linear 

solution).  The  first  mode  Rossby  wave  resulting  from  the 
Kelvin  wave  reflection  is  located  4000  km  from  the  eastern 
boundary.  The  contour  Interval  Is  5  cm/sec.  (b)  Sane  as  (a) 
except  at  Day  225. 


10*  KM 

Figure  4:  (a)  Same  as  Fig.  3  except  for  Case  4  (nonlinear  solution),  (b) 

Zonal  velocity  at  Day  200  for  Case  4.  Equatorial  sollton  is 
clearly  evident  at  x  ■  5000  km. 


most  likely  be  generated  during  El  Nino  events.  These  results  also 
suggest  that  even  wind  events  on  a  smaller  scale,  both  temporally  and 
spatially,  can  generate  Rossby  solitons  and  that  these  waves  may  have  a 
significant  influence  on  the  variability  of  the  equatorial  thermocline. 
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